STA2023 Review:
Point Estimation
Introduction: Topics
- Basic descriptives:
- Continuous variables:
- Mean
- Median
- Variance and standard deviation
- Range and interquartile range
- Categorical variables:
- Count
- Overall percentage
- Row percentage
- Column percentage
Introduction: Data
- Name: the pony’s name
- Type: type of pony (Earth, Pegasus, Unicorn, Alicorn)
- Sex: sex/age of pony (Coal, Filly, Stallion, Mare)
- Flying speed: average flying speed (km/hr) for winged ponies
- Friendship: a harmony index from friendship activities (0-10)
- Magical energy: measured magical energy output (sparkles) for magical ponies
- Tail shimmer: how much light reflected by the pony’s tail (lux)
Types of Variables: Qualitative
- A qualitative or categorical variable classifies an observation into one of two or more groups or categories.
- Nominal: purely qualitative and unordered
- Ordinal: data can be ranked, but intervals between ranks may not be equivalent
- Examples:
- satisfaction rating
- favorite color
- type of pet
- education level
- blood type
Types of Variables: Quantitative
- A quantitative or continuous variable takes numerical values for which arithmetic operations such as adding and averaging make sense; typically has a unit of measure.
- Interval: meaningful differences between values, but no true zero point
- Ratio: meaningful differences and a true zero point
- Examples:
- age (years)
- temperature (Celsius)
- daily hours of sleep
- ACT or SAT score
- height (inches)
Types of Variables: Example
- Name: the pony’s name
- Type: type of pony (Earth, Pegasus, Unicorn, Alicorn)
- Sex: sex/age of pony (Coal, Filly, Stallion, Mare)
- Flying speed: average flying speed (km/hr) for winged ponies
- Friendship: a harmony index from friendship activities (0-10)
- Magical energy: measured magical energy output (sparkles) for magical ponies
- Tail shimmer: how much light reflected by the pony’s tail (lux)
Describing Data: Why?
- Why do we describe data? We want to tell a story!
- Summarize n observations into a single description
- Understand what is in the data
- Spot patterns, missing data, or outliers
- Compare groups or spot differences or oddities
Describing Data: How?
- How do we describe data?
- Numbers
- Frequency table
- Mean & standard deviation
- Median & IQR
- Graphs
- Bar charts
- Box plots
- Histograms
Point Estimation: Mean
- Mean: the average of a set of values
\bar{y} = \frac{\sum_{i=1}^n y_i}{n}
- Find the mean for the flying speeds (km/hr) of 5 ponies: {10, 20, 30, 40, 100}
\bar{y} = \frac{\sum_{i=1}^n y_i}{n} = \frac{10 + 20 + 30 + 40 + 100}{5} = 40
- The average flying speed for winged ponies is 40 km/hr.
Point Estimation: Variance
- Variance: A measure of spread; the average of squared differences from the mean.
- Higher variance = data has more spread.
- In squared units of the data.
s_y^2 = \frac{\sum_iy_i^2 - (\sum_iy_i)^2/n}{n-1}
- Find the variance for the flying speeds (km/hr) of 5 ponies: {10, 20, 30, 40, 100}
s_y^2 = \frac{\sum_iy_i^2 - (\sum_iy_i)^2/n}{n-1} = \frac{(10^2+...+100^2)-(10+...+100)^2/5}{4} = 1250
- The variance is 1250 (km/hr)2
Point Estimation: Standard Deviation
- Standard Deviation: A measure of spread; the average distance from the mean.
- Higher standard deviation = data has more spread.
- Same units as the data.
s_y = \sqrt{s_y^2}
- Find the standard deviation for the flying speeds (km/hr) of 5 ponies: {10, 20, 30, 40, 100}
s_y = \sqrt{s^2_y} = \sqrt{1250} \approx 35.36
- The standard deviation is 35.36 km/hr.
Point Estimation: Range
- Range: difference between the maximum and minimum values
\text{range} = \text{max}(y) - \text{min}(y)
- Find the range for the flying speeds (km/hr) of 5 ponies: {10, 20, 30, 40, 100}
\begin{align*}
\text{range} = \text{max}(y) - \text{min}(y) = 100 - 10 = 90
\end{align*}
- The range of the flying speeds is 90 km/hr.
Point Estimation: Interquartile Range
- Interquartile Range (IQR): range of the middle 50% of the data.
\text{IQR} = \text{P}_{75} − \text{P}_{25}
- Find the IQR for the flying speeds (km/hr) of 5 ponies: {10, 20, 30, 40, 100}
- Recall that the median is 30.
- We then find P_{25} using {10, 20} and P_{75} using {40, 100}
- Thus, P_{25} = 15 and P_{75} = 70.
\begin{align*}
\text{IQR} = \text{P}_{75} − \text{P}_{25} = 70 - 15 = 55
\end{align*}
- The IQR of the flying speeds is 55 km/hr.
Point Estimation: Proportion
- Proportion: a type of mean for categorical data
- Often expressed as a percentage
- Useful for categorical responses
\hat{p} = \frac{\sum_{i=1}^n y_i}{n},
y_i =
\begin{cases}
1 & \text{if in category }i \\
0 & \text{otherwise}
\end{cases}
Point Estimation: Proportion
Find the proportion of ponies that have wings in the following sample: {Y, N, Y, Y, N, Y}
Count the number of “Y” responses and divide by total:
\hat{p} = \frac{\sum_{i=1}^n y_i}{n} = \frac{4}{6} \approx 0.67
- The proportion of ponies with wings is 0.667 (or 66.7%).
Point Estimation: Frequency Table
- Frequency table: A table showing how often each value appears in a dataset.
- Useful for categorical responses.
- For each category, i, we report n_i (\%_i)
- Find the freqency table for the following sample of 8 ponies: {Earth, Pegasus, Unicorn, Earth, Pegasus, Pegasus, Unicorn, Alicorn}
- Frequencies:
- Alicorn: n_{\text{A}} = 1
- Earth: n_{\text{E}} = 2
- Pegasus: n_{\text{P}} = 3
- Unicorn: n_{\text{U}} = 2
- Proportions:
- Alicorn: \hat{p}_{\text{A}} = 1/8 = 0.125
- Earth: \hat{p}_{\text{E}} = 2/8 = 0.250
- Pegasus: \hat{p}_{\text{P}} = 3/8 = 0.375
- Unicorn: \hat{p}_{\text{U}} = 2/8 = 0.250
Point Estimation: Frequency Table
- Putting this into a table,
Point Estimation: Contingency Table
Contingency table: A table that summarizes two qualitative variables and their overlap.
We will not concern ourselves with the derivation, but will rely on R.
Consider this data,
Point Estimation: Contingency Table
- The resulting contingency table would look someting like this:
- We are using column totals as our denominators.
# A tibble: 4 × 3
pony_type No Yes
<chr> <chr> <chr>
1 Alicorn 0 (0.0%) 1 (25.0%)
2 Earth 2 (50.0%) 0 (0.0%)
3 Pegasus 0 (0.0%) 3 (75.0%)
4 Unicorn 2 (50.0%) 0 (0.0%)
Graphs: Box Plots
Box plots display the distribution of a continuous variable using the five number summary:
- Whisker: Minimum
- Beginning of box: 25th percentile (first quartile; Q1, P25)
- “Middle” of box: Median (50th percentile, second quartile; Q2, P50)
- End of box: 75th percentile (third quartile; Q3, P75)
- Whisker: Maximum
We use box and whisker plots to get an idea of the spread and skewness of the data.
Note: there are different ways to define the whiskers.
- I use the min/max as whiskers when sketching by hand.
ggplot() uses 1.75 \times IQR.
Graphs: Histograms
Histograms show the distribution of a continuous variable.
- What is the shape of the distribution?
- Is the distribution symmetric? Skewed? How skewed?
Values are grouped into intervals (“bins”), then the bin height demonstrates how many values fall into that interval.
This allows us to quickly see if there are any oddities.
- Increased proportion of a specific value/bin.
- Zero inflation? Value used to indicate missing?
- Any values that are “out in the tail”.
- Outlier? Data entry error?
Graphs: Bar Graphs
- Bar graphs display the distribution of categorical data.
- The frequency or proportion of observations is displayed on the bar graph.
- Bar graphs usually have categories on the x-axis and counts or proportions on the y-axis.
- Note that we could flip the axes to create a vertical bar graph.
- Note that the bars are separated on the x-axis to indicate the lack of continuity.
Graphs: Bar Graphs
- Consider the bar graph, below.
Graphs: Side-by-Side Bar Graphs
- Consider the bar graph, below.
Graphs: Stacked Bar Graphs
- Consider the bar graph, below.
Graphs: Histograms vs Bar Graphs
We have now reviewed two “bar style” graphs that we see regularly: histograms and bar graphs.
We use histograms to see the distribution of continuous variables.
- The x-axis represents numeric intervals.
- The bars touch each other to represent continuity.
We use bar graphs to see the distribution of categorical variables.
- The x-axis represents categories.
- The bars do not touch each other, implying distinct categories.
Graphs: Scatterplots
- Scatterplots allow us to look at the relationship between two continuous variables.
- Each point on the graph represents one observation.
- What statisticians use scatterplots for:
- Explore patterns (aka trends or relationships).
- Linear relationships.
- Non-linear relationships.
- Detect clusters of observations.
- Find oddities in the data (outliers).
- When we describe the relationship, we are really answering the question, “As x increases, what happens to y?”
Graphs: Scatterplots
- Consider the scatterplot, below.
Graphs: Scatterplots
- Consider the scatterplot, below.
Graphs: Scatterplots
- Consider the scatterplot, below.
Graphs: Scatterplots
- Consider the scatterplot, below.
Graphs: Scatterplots
- Consider the scatterplot, below.
Wrap Up
- We have covered (“reminded” ourselves of) a lot today!
- Always remember that I do not expect you to:
- Memorize code.
- Produce code in a timed environment.
- Automatically know how to do these things.
- I do expect you to:
- Use your resources (lecture slides, GitHub website, Discord).
- Try your best.
Wrap Up
- Today’s lecture:
- Basic summarization of data.
- Basic data visualization.
- This week’s lab:
- Summarizing data
- Visualizing data
- Next week:
- Review of statistical inference.
- Confidence intervals and hypothesis tests.
- One sample means.
- Two sample means.
- Independent data.
- Dependent data.
Wrap Up
- Daily activity: the .qmd we worked on during class.
- Due date: Monday, June 23, 2025.
- You will upload the resulting .html file on Canvas.
- Please refer to the help guide on the Biostat website if you need help with submission.